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Abstract 



o 

O ■ Although nonnegative matrix factorization (NMF) is NP-hard in general, it has been shown 

very recently that it is tractable under the assumption that the input nonnegative data matrix 
O ■ is separable (that is, there exists a cone spanned by a small subset of the columns containing 

•^-j- | all columns). Since then, several algorithms have been designed to handle this subclass of NMF 

problems. In particular, Bittorf, Recht, Re and Tropp ('Factoring nonnegative matrices with linear 
programs', NIPS 2012) proposed a linear programming model, referred to as HottTopixx. In this 
paper, we provide a new and more general robustness analysis of their method. In particular, our 
analysis is almost tight and allows duplicates and near duplicates in the dataset. Moreover, we 
design a provably more robust variant using an appropriate post-processing strategy. 

a- 

Keywords. Nonnegative matrix factorization, separability, robustness, linear programming, Hott- 
Topixx. 
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!>■ ! 1 Introduction 

00 

Nonnegative matrix factorization (NMF) is a popular machine learning technique and allows to express 
a set of nonnegative vectors as nonnegative linear combinations of nonnegative basis elements [9] . More 
formally, given a nonnegative matrix M G W^ xn corresponding to n vectors in an m-dimensional space 
and a factorization rank r, the aim is to find a basis matrix U G M™ xr and a weight matrix V G M^ xn 
such that the norm of the error M — UV is minimized. Although NMF is NP-hard [10], Arora et 
al. [1] recently showed that it can be solved in polynomial time given that the matrix M is separable. 
A nonnegative matrix M G R™ xn is r-separable if and only if it can be expressed as M = WH, where 
W G M™ xr , H G M+ Xn , and each column of W is equal to a column of M. In other terms, M G R™ xn 
is r-separable if and only if 

M = W[I r ,H']U = [W,WH']U, 

for some H' G R^_ xn and some permutation matrix n G {0, l} nxn . Any nonnegative matrix is n- 
separable because of the trivial decomposition M = MI with r = n, and the aim is to find a decompo- 
sition where r is as small as possible. It is rather straightforward to check that the smallest such r is the 
number of extreme rays of the cone generated by the columns of M, that is, cone(M) = {M x \ x G M™ }. 
Equivalently, if the columns of matrix M are normalized to sum to one, the smallest such r is the num- 
ber of vertices of the convex hull of the columns of M, that is, conv(M) = {Mx \ x G WL, Y27=i Xi = 1}> 
see [8] and the references therein for more details about the geometric interpretation of the separable 
NMF problem. 



*E-mail: nicolas.gillis@uclouvain.be. The author is a postdoctoral researcher of the Fonds de la Recherche Scientifique 
(F.R.S.-FNRS). This text presents research results of the Belgian Program on Interuniversity Poles of Attraction initiated 
by the Belgian State, Prime Minister's Office, Science Policy Programming. The scientific responsibility is assumed by 
the author. 
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It turns out that the separability assumption makes sense in several practical situations. For 
example, in document classification, each column of M corresponds to a document (that is, a vector 
of word counts) and is approximated with a nonnegative linear combination of the columns of matrix 
W which correspond to different topics (that is, bags of words). Separability of M requires that, for 
each topic, there exists at least one document discussing only that topic. Separability of M T (that 
is, each row of H appears as a row of M) requires that, for each topic, there exists at least one word 
used only by that topic; see [TJ [2] and the references therein. The separability assumption is also 
widely used in hyperspectral imaging and is referred to as the pure-pixel assumption, see [7] and the 
references therein. 

In practice, the input separable matrix M is perturbed with some noise and it is therefore desirable 
to design robust algorithms; see [TJ El El El El El E] • In this paper, we will focus on the algorithm of 
Bittorf, Recht, Re and Tropp [3], referred to as HottTopixx, which is described in the next section. 
As we will see, the robustness result provided by the authors is rather restrictive as it does not allow 
near duplicates in the dataset: the aim of this paper is to develop a more general analysis of their 
algorithm. 



1.1 HottTopixx: a Linear Programming Model for Separable NMF 

From now on, we will always assume that the columns of the input data matrix M have been normalized 
in order to sum to one, that is, 

(i) The zero columns of M have been discarded, and 

(ii) Each column of M is updated using M(:,j) ^— rmTr^jTU ■ 

We will also always assume that we are given a noisy separable matrix M = M + N where ./V € R mxn 
is some noise added to the separable matrix M such that 

||iV||i = max ||JVa;||i = max ||iV(:, J)||i < e, for some e > 0. 
IMIi<i i 

The matrix M is r-separable if and only if 

m = wh = w[i r , H']n = [w, WH']n = [w, wh'] ( Ir 11 ) n = mi, (i) 

V V ' 

for some H' > and some permutation matrix II. Equation (JTJ) shows that M is r-separable if and 
only if there exists a nonnegative matrix X € M" x?1 such that : (1) X contains (n — r) all-zero rows 
and the r-by-r identity matrix as a submatrix (up to permutation), and (2) M = MX. Notice that 
because the columns of matrix M and W sum to one, the columns of the matrix H' have sum to 
one as well. Based on this observation, Bittorf et al. [3] proposed to solve the following optimization 
problc in order to identifying approximately the columns of the matrix W among the columns of 
the matrix M: 

min p T di&g{X) 

such that \\M - MX\\\ < 2e, 

tr(X) = r, (2) 
X(i, i) < 1 for all i, 
X{i,j) < X(i,i) for all 



1 In [3], the model assumes separability of M T so that (J5J) is equivalent to the model in [5] applied to M T . We prefer 
here to work with the columns. 
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where p G R n is any vector with distinct entries. Intuitively, the model reads as follows: we have 
to assign a weight to each column of M (that is, give a value to X(i, i) for all i) for a total weight 
of r. Moreover, we cannot use a column to reconstruct another column with a weight larger than the 
corresponding diagonal entry of X, while we have to guarantee that the approximation error is small. 
It is interesting to notice that the problem is always feasible: in fact, 

X°=( Ir H> )u, with M = MX , 

is a feasible solution of since the columns of H' sum to one and 

\\M -MX°\\i = \\(M + N) - (M + iV)X°||i 

< ||M - MX ||i + ||JV||i + ||AA°||i < 2e. 

Finally, Bittorf et al. [3] proposed Algorithm [1] to identifying approximately the columns of W among 
the columns of M, while the corresponding optimal weight matrix H can be obtained by solving 
another linear program, see Algorithm [2j 

Algorithm 1 HottTopixx - Extracting Columns of a Separable Matrix by Linear Programming [3j 

Input: A noisy r-separable matrix M = WH + N, where ||iV||i < e and W is a-robustly simplicial. 
Output: A matrix W such that ||W(:, P) — W\\i < 5 for some permutation P and some 5 > 0. 

1: Find the optimal solution X* of fl2J). 

2: Let /C be the index set corresponding to the r largest diagonal entries of X* . 
3: Set W = M(:,JC). 



Algorithm 2 Approximably Separable NMF using HottTopixx and Linear Programming [3] 

Input: A noisy r-separable matrix M = WH + N, where ||iV||i < e and W is a-robustly simplicial. 
Output: An nonnegative factorization (W, H) such that ||M — WjFT||i < e + 5. 

1: Compute W using Algorithm [TJ 
2: Solve H = argmin y>0 ||M — W'Y'lli. 



Before stating robustness results, it is important to define the conditioning of matrix W, which is 
a crucial characteristic of separable NMF problems. In fact, the better the columns of W are spread 
in the unit simplex A m = {x G M m | x > 0, YliLi = 1}> the more noise tolerant the data will be. 
In [U [3], this conditioning is measured via the following parameter: 

a = min \\W{:,k) -W(:,K)x\U, where K = {1, 2, . . . , r}\{k}, 

KKr.ieA''- 1 

and the matrix W is said to be a-robustly simplicial. (Notice that a < 2 for any nonnegative matrix 
W whose columns sum to one.) In other words, a is the minimum among the ^-distances between 
a column of W and the convex hull of the other columns of W. It is necessary that ||iV||i < e < ^ 
for any separable NMF algorithm to be able to approximately recover the columns of W from the 
matrix M = WH + N. In fact, if e > %, any r-separable matrix M with r > 2 can be perturbed so 
that one of the columns of the perturbed matrix M corresponding to a column of W belongs to the 
convex hull of the other columns. In other words, we can perturb the matrix M so that it becomes 
(r — Inseparable and we could therefore not distinguish one of the columns of W from the columns 
of M. For example, with 

W = ( (i -i) e T ) ,H = Ir and N = ( ) ' We haVe = ( 
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so that the matrix M = WH is r-separable while M is 1-separable. 

In order to prove robustness results for Algorithm^! Bittorf et al. [3] used the following observation: 

Lemma 1. Suppose M is normalized and admits a rank-r separable factorization WH, and suppose 
M = M + N with \\N\\i < e. If W is such that \\W(:,P) - W\\x < 8 for some 5 > and some 
permutation P, then Algorithm^ constructs a factorization (W, H) satisfying \\M — WH\\\ < e + 5. 

Proof. Denoting Nw = W(:,P) — W, we have 

||M - WH\\i = argmin y > \\M + N - (W + N W )Y\\ 1 

< \\M + N - (W + N W )H\\ 1 

< \\N\l! + \\NwH\li + \\M - WH\\ X 
<e + 5, 

since the columns of H sum to one. □ 

Lemma [1] allows us to focusing on proving robustness results for Algorithm [TJ In fact, any result 
that applies to Algorithm [1] directly applies to Algorithm [2J In this paper, we will therefore focus our 
attention on Algorithm [H as it was implicitly done in [3J [3] . 

In the first version of their paper [3], two different robustness results were proposed. The first 
one assumes that there is no duplicates or near-duplicate of the columns of W in the dataset, which 
amounts for the maximum entry of H' in Equation ([T|) to be bounded above: 

Proposition 1. Suppose M = M + N where M is normalized, admits a rank-r separable factorization 
WH with W a-robustly simplicial and ||iV||i < e, and has the form ([1]) u>ii/H HiT'Hoo = maxjj \H'-\ < 
P < 1. Suppose e < f(a,(3,r) for some function f. Then Algorithm^ extracts a matrix W satisfying 
\\W — W{:,P)\\\ < e for some permutation P. 

The second one is general as it does not make any assumption on matrix H'\ 

Proposition 2. Suppose M = M + N where M is normalized, and admits a rank-r separable factor- 
ization WH with W a-robustly simplicial and ||iV|[i < e. Suppose e < g(a,r) for some function g. 
Then Algorithm^ extracts a matrix W satisfying \\W — W(:,P)||i < 5 for some permutation P and 
some 6 > 0. 

The aim is to finding functions / and g as large as possible and 5 as small as possible such that 
the propositions above hold. In [U Prop. 3.3] (resp. [4j Prop. 3.2]) authors originally claimed that 
f(a,fi,r) = \ol(1 — f3) (resp. g(a,r) = 8 " 4a and 6 = 4e) gives the result. Unfortunately, there were 
some flaws in the proofs, as is shown by the following counter example. Let 
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2 Usually \\A\\ P = max 1 1 a. 1 1 <i ||j4a;||p hence ||^4.||oo should refer to max; \\A(i, :)||i but we assign it a different meaning 
in this paper. 
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is an optimal solution of (J2J) with p = (1,2,3, —1) T , so that the last column of M is selected by 
Algorithm [TJ which gives 

||W'(: J P)-W||i = 2>4£ = 0.6 and min MM - WH\U = 1 > 4e, 

H>0 

for any permutation P, a contradiction with Propositions 3.2 and 3.3 in [TJ. 

In the final version [3], only the following robustness result is proposed : 

Proposition 3 ([3], Prop. 3.2). Suppose M = M + N where M is normalized, and admits a rank-r 
separable factorization WH with W a-robustly simplicial and \\N\\\ < e. Suppose for each column 
M(:,j) 1 < j < n either M(:,j) = W(:,k) for some 1 < k < r or \\M(:,j) - W(:,k)\\i > d for all 
1 < k < r. If e < ^f 1 and do < ^, then Algorithm^ extracts a matrix W satisfying min#>o ||M — 
WH\\ 1 <2e. 

By Lemma [JJ and the fact that (see Lemma HJ) 

||M(:,j)-W(:,A;)||i>dofaraIll<*!<r => ||-ff(:,i)||oo < P = 1 - y, 

Proposition [3] is essentially equivalent to Proposition [TJ The only differences are that (1) Proposition [3] 
allows duplicates of the columns of W in the dataset, and (2) Proposition [JJ does not need any 
assumption^! on /3. 

Finally, the robustness result proposed in [3] only deals with input data matrices without near 
duplicates. In this paper, we propose a variant of HottTopixx (Algorithm [3]) which is more robust, 
and applies to any separable matrix. 

1.2 Conditioning and /t-Robustly Conical Matrices 

Because the columns of the variable X in ([2]) are not required to sum to one (note that this constraint 
could be added to the model while keeping linearity), it turns out that it will be easier to work with 
the following parameter measuring the conditioning of matrix W: 

K= min min \\W(:, k) - WU,K)x\\i, where K = {1, 2, . . . , r}\{fc}, 

l<k<r xeK ^-! 

and the matrix W is said to be K-robustly conical. We have that k is the minimum among the t\- 
distances between a column of W and the convex cone generated by the other columns of W. If the 
columns of W sum to one (which will always be assumed), k < 1 and we can relate a and k as follows: 

Theorem 1. For any a-robustly simplicial and K-robustly conical nonnegative matrix W whose columns 
sum to one, we have 

k < a < 2k. 

Proof. The first inequality follows directly from the definition. The second is proved in Appendix A. 

□ 

Therefore, it is essentially equivalent to working with a or k as they differ by a multiplicative 
factor of at most 2. 

3 At the time we write this paper, the proof of Proposition [3] is not publicly available, and it is not clear why the 
condition do < § is needed. In fact, it seems that Proposition [3] contradicts Theorem 2] a more natural condition on do 
would be a lower bound, such as do > § (which would be rather restrictive as one of the columns of W is at distance a 
of the convex hull of the other columns of W). 
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1.3 Contribution and Outline of the Paper 

In this paper, we provide a new robustness analysis of Algorithm [TJ In Section (2J we focus on 
Proposition [TJ and prove that 

• e < YTr+X) * s sufficient for Proposition [1] to hold (Theorem [2]) , while 

• e < * s necessary for Proposition [JJ to hold for any r > 3 and /3 < 1 (Theorem [3]) . 

This shows that our analysis is close to being tight. Moreover, the second condition on e implies 
that it is necessary for Proposition [2] to hold that e < (Corollary [TJ. In Section O we focus on 
Proposition [2] and first show that it is necessary that 5 > 3— + |e for any e < % (Theorem^. (We 
also show that this lower bound on 5 applies to a broader class of separable NMF algorithms.) Then, 
we propose a post-processing of the solution of the linear program ([2D (see Algorithm [3D for which the 
following result holds: 

(Theorem [5]) Let M = WH be a normalized r-separable matrix with W K-robustly conical. Let also 
M = M + N with \\N\\x < e. If 

UJK 

6 < 74(r + 1) ' 

where oj = min^j ||W(:,i) — W(:,j)||i, then Algorithm^ extracts a matrix W such that 

\\W — W(:, P)\\i < 37(r + 1) — h 2e, for some permutation P. 

Because the necessary condition e < also applies to Algorithm [3j the bound for e of Theorem [5] is 
tight up to a factor u (and some constant multiplicative factor). Moreover, because of Theorem [^ the 
bound for 5 of Theorem [5] is tight up to a factor r (and some constant multiplicative factor). Finally, 
we show that it is necessary for Proposition [2] to hold that e < ^z3[p f° r an y S < k + e (Theorem [6|) 
which demonstrates that HottTopixx cannot achieve a better bound than Algorithm 

1.4 Notation 

The set of m-by-n real matrices is denoted M. mxn ; for A £ M mxn , we denote the jth column of A by 
A(:,j), the ith row of A by A(i, :), and the entry at position by A(i,j); for b € M mxl = M m , 
we denote the ith entry of b by b(i). Notation A(X, J) refers to the submatrix of A with row and 
column indices respectively in X and J . The matrix A T is the transpose of A. The £i-norm ||.||i of a 
vector is defined as ||6||i = J2i I^WI an( ^ °^ a matrix as \\A\\i = rci&Xj We will denote by 

E n the n-by-n matrix of all-ones, mxn the m-by-n the matrix of all-zeros, and I n the n-by-n identity 
matrix. We will also denote the ith column of the identity matrix, e the all-one vector and the 
all-zero vector; their dimensions will be clear from the context. The vector of the diagonal entries of 
a matrix A is denoted diag(A) while its trace is denoted tr(A) = e T diag(^4). For a set K., \)C\ denotes 
its cardinality. 

2 Analysis of Proposition [I] 

In this section, we show that e < yfeifj is sufficient for Proposition [JJ to hold (Theorem [2D, while 
e < (, r+ ij(jTf3) +1 is necessary for any r > 3 and /3 < 1 (Theorem [3D . 

Lemma 2. Suppose M = M + N where M is normalized and ||^V||i < e < 1, and suppose X is a 
feasible solution of ([2]) . Then, for all 1 < j < n, 

\\X(:J)\\ 1 <1 + -^- and \\M(:,j)-MX(:,j)\\ 1 <2e 
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Proof. For all 1 < j < n, ||M(:,i) - MX(:,j)\\i < 2e so that 

l-e< ||M(:,j)|[i-[|JV(:,j)||l < \\M(:,j) +N{:, j)||i = ||M(:,j)||i < ||M(:, j)l|i + \\N(i, < 1 + e. 
We then have 

2e > ||M(:,i) - MX(:, j)||i > |(MI(:,j)||i - 

> ||MX(:,j)|| 1 -||M|| 1 

> ||Af|| 1 (||X(:,j)|| 1 -1) 
>(l-6)(||X(:,j)|j 1 -l), 

hence 1 1 (:, i) 1 1 1 < 1 + ^ implying that \\NX(:, j)||i < 1 1 AT| | x | |^T(:, 1 1 x < e (l + ^-Y Therefore, 



2e > ||M(:,i) - MX(:,j)\\x = \\M(:,j) + N(:,j) - (M + N)X(:,j)\\i 

2e 



>||M(:,i)-MX(:,j)|| 1 -e-e 1 + 



1 - e 



from which we obtain ||M(:, j) - MX(:, < 2e (^2+ j^j = 2e (jEf )• □ 

Lemma 3. Let M = M + N where M is normalized, admits a rank-r separable factorization WH 
with W K-robustly conical and \ \N\\\ < e, and has the form (pTJ) with WH'Wqq = maxjj < /3 < 1- 
£ei a/so X 6e any feasible solution of ([2]), i/ien 

X(j,j) > 1 ^ £ ^ ^ — /or all j such that M(:,j) = W(i, k) for some 1 < k < r . 

Proof. Let /C be the set of r indices such that M(:, /C) = W. Let also 1 < k < r and denote j = JC(k) 
so that M(:,j) = W(:,k). By Lemma H 

fc) - < 26 . (3) 

Since H(k,j) = 1, 

= W(:, k) (x(j,j) + H(k, J)X(J, j)) + W(:,K)y, 
where ft = {1,2, ...,r}\{k}, J = {1,2, ... ,n}\{j} and y = H(K, :)X(:, j) > 0. We have 

v = X(j,j) + H(k,J)X(J,j) < X(j,j) + (3 (l + - X(j,j)) , (4) 



1 - e 

2c 



since ||£T(fc, c7)||oo < /? and ||X(:,j)||i < 1 + ^ (Lemma [2]). Hence 

JJ 



\\W{:,k)-WHX{:,j)\\i > (1 - v) 
Combining Equations Q, ((U) and ([5|), we obtain 



1—7) 



>(1- V )k. (5) 

l 



i -[x{jj)+p[i + ^-x(jj)))<- r2 ( 



1 — e // k \ 1 — e 

which gives, using the fact that (k/3)- 1 > 1, 



□ 
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Theorem 2. It is sufficient for Proposition^ to hold that 



e< «(W0 

" 7(r + 1) ' 

Proof. If e = 0, the proof is given in [3j Prop. 3.1]: for each 1 < k < r, there exists 1 < j < n such that 
M(:,j) = W(:, fc) and X(J,j) = 1 (this follows easily from the fact that the entries of p are distinct). 
Otherwise e > and /3 < 1. Let X be a feasible solution of ([2]) (which always exists since the feasible 
set of ([2]) is non-empty). If we prove that the r diagonal entries of X corresponding to the columns of 
W are larger than all the other ones (because /3 < 1, the are no duplicates of the columns of W in the 
dataset), then we are done. In fact, these columns will then be identified by Algorithm [2] and we will 
have \\W — W(:,P)\\i < e for some permutation P. (Notice that we do not need an optimal solution: 
any feasible solution identifies the columns of W.) 

Let tC be the set of r indices such that M(:,JC) = W. Assume that 

X(k, k) > -^-j for all k <E K. (6) 
Since tr(X) = r and X > 0, we have 



V X(j, j) = r - V X(k, k)<r- r— — = —— < X(k, k) for all k G K, 

r + 1 r + 1 

UK fce/c 

implying that X(J,j) < X(k,k) for all k G /C, j £ fC which gives the result. It remains to show that 
© holds. By Lemma [3l 

* ( ^)>i--^(^)>i--^, 

H 1 -P) V 1 - e/ K (! - P) 

since j5f < 3.5 for any e < 7^+1) ^n as 0^ K '/5^1 an d r > 1. Finally, for e < jjy^jj, 
X(i,i) > ^qr[ and the proof is complete. □ 

Remark 1. The proof of Theorem^ actually does not make use of the constraints X(i,j) < X(i,i). 
The reason is that the assumption \\H' < /3 < 1 implies that there is no duplicate of the columns of 
W in the dataset (if (3 = 1, e = and Algorithm^is guaranteed to work J21 Prop. 3.1]). This implies 
that for being able to reconstruct sufficiently well each column of W , the corresponding diagonal entry 
of X must be large independently of the other entries of the corresponding column of X. 

Therefore, in case we know there is no duplicate in the dataset ( or have used some pre-processing to 
remove them), these constraints can be discarded (a similar observation was made in l^\J$). Moreover, 
since Theorem^ only requires feasibility in that case, any feasible solution of the corresponding relaxed 
linear program will correctly identify the columns ofW. 

Theorem 3. For Proposition^ to hold when r > 3 and f3 < 1, it is necessary that 

£ < "fr-fl , (7) 
^ (r-l)(l-/3) + l {) 

Proof. See Appendix B. □ 

Theorem [3] shows that the sufficient condition derived in Theorem [2] is close to being tight. In 
particular, if r is assumed to be bounded above, then it is tight up to some constant multiplicative 
factor (in practice r is often assumed to be small). We believe it is possible to improve the bound of 
Theorem [2] to match the one of Theorem (up to some constant multiplicative factor). Unfortunately, 
we were not able to derive such a sufficient condition; this is a topic for further research. 
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Corollary 1. For Proposition^ to hold for any 5 < f and r > 3, it is necessary that 

K 

r — 1 

Proof. In fact, 

k(1-/8) _ k 

e < (l-/3)(r-l) + l - (1 — /3)(r — 1) ~ r-l' 

while the matrix M = WH + TV constructed in the proof of Theorem [3] satisfies \\W — W(:, P)\\i > 
> § where is the matrix extracted by Algorithm Q] and P is any permutation. □ 

Remark 2 (Cases r = 1,2). Theorem^ does not apply when r = 1,2 because: 

• T/ie rank-one separable NMF problem is trivial. In fact, if M admits a rank-one separable 
factorization wh T and M = M + N with ||iV||i < e, then \\M(:,j) — w\\\ < e /or j. 

• T/ie rank-two case is particular because it is not possible to construct very bad instances. In 
fact, all rank-two separable NMF problems are essentially equivalent to each other because the 
columns of M belong to the line segment [W(:, 1), W(:, 2)] . 



3 Analysis of Proposition [2] and Post-Processing Strategy 

In this section, we investigate Proposition [2] and propose a variant of HottTopixx (see Algorithm [3j) 
which is more robust and applicable to any noisy separable matrix. In Section 13.11 we present a 
simple necessary condition for Proposition [2] to hold. In Section [3.2^ we show that, for each column 
of W, there is a subset of the columns of M close to that column of W such that the sum of the 
corresponding diagonal entries of any feasible solution X of ([2]) is larger than ^n-. Therefore, using an 
appropriate post-processing of the solution X of ([2]) (see Algorithm [3]) , we can approximately recover 
the columns of W, given that the noise level e is smaller than some upper bound. Finally, we show in 
Section 13.31 that HottTopixx (Algorithm Q]) cannot achieve this bound which shows that Algorithm [3] 
is more robust. 



3.1 Preliminary Necessary Conditions 

Recall the aim is to identifying, among the columns of M, r columns gathered in the matrix W such 
that \\W — W(:,P)\\i < 5 for some permutation P and some 5 > 0. Since ||W||i = 1, we will assume 
that 5 < 1 otherwise the separable NMF problem is trivial since the solution W = gives the result. 
It actually makes sense to impose 5 < k < a < 1: this guarantees for a solution W to have distinct 
columns since two columns of W can potentially be at distance k; for example with 

/ § 
W= f 

extracting twice the first column would give the result with 5 = k, which is not desirable. Moreover, 
as shown in Section 11.11 it is necessary that e < l| < k for any separable NMF algorithm to being 
able to extract approximately the columns of W. 

Theorem 4. For any < e < ^, it is necessary that 5 > (3^ + |e) for Proposition^ to hold. 

Proof. Let us consider M = M + N = WH + iV where 

( l 5~!\ /l 1-A \ / | \ 

W= 0l|-f ,il= 10 1 - A , and N = § , 
\ f / \ 1 A A / \ -§ / 
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where W is a-robustly simplicial (and ^-robustly conical), and where A is such that the middle point 
between M(:,4) and M(:,5) is (m(:, 3) + 2N(:, 3)V that is, 



M(:,3) + 27V(:,3) 








Aa \ 










'j 


_ A I 












Aa 4 / 






2 / 



1 



(M(:,4)+M(:,5)), 



which requires A = 1 — 3^ > 0. Let p = (-K, —K, K 2 , — 1, 0) T for any K sufficiently large. It can be 
checked that 

/ 1 fi \ 

10 fi 

X = where fi 

0.5 0.5 0.5 -fi 

V 0.5 0.5 -fi 0.5 / 



1 - A 
2^A' 



is a feasible solution of ([2]). By Lemma [71 there exists K sufficiently large such that X*(3,3) = 
for any optimal solution X* . Using Lemma [7J again we have X*(l, 1) = X*(2,2) = 1 for any optimal 
solution X* for K sufficiently large. Hence, for K sufficiently large, the third column of M will not 
be extracted and the fourth or fifth will be, hence 

\\W- W\\ x = 11^0,3) -W(:,3)||i = ||M(:,4)-PF(:,3)||i = \\M(:, 5) - W(:, 3)||i 

= \\(1-\)W(:,1)-(1-\)W(:,3)\\ 1 

= S-\\W(:,1)-W(:,3)\\x 
a 



a \ 2 



3i + |, 
a 2 



□ 



Using the same construction^! as in Theorem 2] but taking A = 1 — ^ , we have 
M(:,3) = W(:,S)+N(:,3) = ± (M(:, 4) + M(:, 5)) , 



for which [(^(ijP) — W\\i > — + | for any permutation P, where W is the matrix extracted by 
HottTopixx. We notice that the corresponding matrix M can also be obtained from a 4-separable 
matrix M4 = W4H4 where 



Wa 



M(:,[12]) M(:,A)-v M(:,5)-v ) ,H 4 



( 1 \ 

10 

0.5 1 

\ 0.5 1 / 



v = ( e /4,e/4,-e/2) T , and 

= ( 3X 2 v v v ) , 
and we have M = WH + N = W4H4 + N 4 = M 4 . Therefore, 

No algorithm to which only the noisy separable matrix M and the noise level e are given 
as input can approximately extract the columns of W among the columns of M with error 
smaller than O (^) . 



4 A Matlab code is available at https://sites.google.com/site/nicolasgillis/code containing this construction, 
along with the one of Theorem [4] 
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In fact, the matrix M above has two solutions to the noisy separable NMF problem and there is no 
way to discriminate between them (the original matrix could be 3- or 4- separable): 

• If the algorithm returns a matrix W with three columns, then if the original matrix was M4 we 
have maxi<j<4imni< k <s\\W4(--,j) -W(:,k)\\i > |. 

• Similarly, if the algorithm returns a matrix W with four columns, then 

— if the third column is not extracted and the original matrix was M, we have 

max min |[W(:,j) — W(:, k)\\x > — , while 
i<?<3 i<fc<4 a 

— if the third column is extracted and the original matrix was M 4 , we have 

max . min \\W A (:,j) -W(:,k)\\i > -. 
i<j<4i<fc<4 a 

The reason is that the distance between each pair of columns of M is at least — . 

Note that the algorithm of Arora et al. pQ achieves this optimal bound O (— ). However, it 
requires the parameter a as an input so that the construction above does not prove their algo- 
rithm is optimal up to some constant multiplicative factor. In fact, for the 3-separable matrix M, 
W is a-robustly simplicial while, for the 4-separable matrix M4, W4 is a'-robustly simplicial with 
«'<2f = ||W 4 (:,3)-W 4 (:,4)|| 1 . 

3.2 Cluster Identification 

We now prove that there is a cluster of columns of M around each column of W for which the sum of 
the corresponding diagonal entries of any feasible solution X of ([2]) is large. More formally, defining 
the clusters around the columns of W as 



p k = {j\\\M(:,j)-W(:,k)\\x<p} 1 < k < r, (8) 

we are going to prove that = ^2j £ qp X(j,j) is large for any feasible solution X of (|2|), given that e 
is sufficiently small. 

Lemma 4. Let W € M™ xr have its columns sum to one, and let h G A m . Then, denoting k = 
argmax 1< j <r h{i), we have 

\\h\\ 00 = h(k)>\-^ Ww^-whW^p. 

Proof. Let us denote denote 1Z = {1, 2, ... , r}\{k}, we have 

WW^ty-WhWt = ||(1 - h(k))W(:, k) - W(:, iT)/i(7£)||i 

< (1 - h(k))\\W(:,k)\\x + (\\h\\l ~ hk^lW^n)^ 

< 2(1 - h{k)) < p. 

□ 

Lemma 5. Let M = WH be a normalized r-separable matrix with W K-robustly conical. Let also 
M = M + N where ||iV||i < e < 1, and X be a feasible solution of ([2]). Then, the total weight 
Ck = ^2j & QP assigned to the columns of M in £l p k defined in (JSj) satisfies 

c k > 1 - — ( ) for alll<k<r. 

k P \l-ej 
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Proof. Let 1 < k < r and 1Z = {1,2,... ,r}\{k}, and let us denote the indices corresponding to the 
columns of M not in £l k as 

ng = {l,2,...,n}\fi£. 

Let also j be such that W(:,k) = M(:,j). By Lemma H \\H(:, ClfyWoo < 1 - f = /3. The rest of 
the proof is similar to that of Lemma [3l By Lemma El — WHX(:, < 2e ffEfl an d 

||*0,j)||i <l + iSi- We have 

WHX(:,j) = W(:, k)H(k, :)X(:,j) + W(:,K)H(K, :)X(:,j) 

= w(-., k) (H(k, n p )x(n p ,j) + H(k, n p )x(n p ,j)) + w(-., n) y , 

where y = H(K, :)X(:,j) > 0, and 

r, = H(k, n p )X(Q p ,j) + H(k, n?)x(n P ,j) 

< \\X(n p ,j)\\ 1+ /3(\\X(:,j)\\i - j)||i) <c k +p(l + - c k 



The first inequality follows from H (i, j) < 1 for all i,j and | \H(k, ^S)||oo < /3; the second from X(i, j) < 
for all i,j (hence c k > \\X (Q p , j)\\x) , and /3 < 1. Finally, (1 - rf)n < \\W{:,k) - WHX(:,j)\\x 
< 2e j^j i eading to Cfc = £ j) > 1 - ^ (f=f ) = 1 - | (f=| ) . □ 

If we can guarantee that c k > f° r all 1 < A; < r, then the sum of the diagonal entries of 
X corresponding to columns of M not in any 0, k will be smaller than ^ql_. Therefore, if instead of 
picking the r largest diagonal entries of X, we cluster the diagonal entries of X depending on the 
distances between the corresponding columns of M, we should be able to identifying the columns of 
W approximately; see Algorithm [3l 

Algorithm 3 Extracting Columns of a Separable Matrix by Linear Programming and Clustering 
Input: A r-separable matrix M = WH + N with ||iV||i < e < 74^+1) and W is K-robustly conical. 
Output: A matrix W such that ||W(:, P) — W\\% < 37(r + + 2e for some permutation P. 

1: Compute the optimal solution X of ([2]). 

2: Initialize JC = {k \ X(k, k) > ^qr[} and v = 2e. 

3: while |/C[ < r do 

4: Compute K, with Algorithm d] using input rrij = M(:,j) 1 < j < n, x = diag(A) and v\ 
5: v <— 2v; 
6: end while 
7: W = M{:,K) ■ 

Lemma 6. Let rrij £ R m 1 < j < n, x € M™ 6e suc/i i/iai ]Cj=i x = r ; anc ^ P — 0- -^ e ^ a ^ so 
0^ = {rrij | ||mj — Wk\\l < p} /or 1 < A; < r where w k £ M m 1 < A; < r. Suppose 

• a; = min^j | \wi — Wj\\i > 6p, and 

• For all 1 < k <r, there exists 1 < j < n such that \\rrij — w k \\i < e < p. 

Then, for any (p + e) < v < 2(p + e), Algorithm^ identifies a set fC with r indices such that 

max min llm, — Wu\\-\ < 3p + 2e. (9) 

Moreover, if Algorithm^ identifies a set K, with r indices for some v < p + e, then K, satisfies ([9]). 
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Algorithm 4 Cluster Extraction 



Input: A set of points rrij 1 < j < n, a vector of weights x € M + such that Y17=j x = r > ana - ^ — 0- 
Output: A index set /C of centroids corresponding to clusters with weight strictly larger than 

1: j) = ||m,j — 1 1 1 for 1 < i, j < n. 

2: Si = {j | D(i,j) < u} for 1 < i < n; 

3: w (i) = J2jes, x ti) for 1 < « < n; 

4: K = 0; 

5: while maxi<j<„ > ^ do 

6: A; = argmaxu;(i); 

7: ZC^/CU{£;}; 

8: For all j £ Sk : w(j) <- 0; 

9: For all i ^ Sk and j G 5^ such that j £ Si : u>(i) — x(j); 

10: end while 



Proof. First notice that the index set K, extracted by Algorithm 2] cannot contain more than r indices. 
In fact, Algorithm S] only identifies clusters with weight strictly larger than while the total weight 
Y^l=i x is equal to r. It remains to show that K, contains at least r indices. 

Let first consider the case (p + e) < u < 2(p + e). Let Si 1 < i < n be the sets computed by 
Algorithm [J] before entering the while loop. We observe that 

• For nij 6 fl^ and my € Sly where j ^ j' and k ^ k', we have rrij ^ Sy and rriji ^ Sj. In fact, 

W m j ~ m j'\\l = \\( m i ~ w k) + (Wk — Wk 1 ) + ("Wk' — m j')\\l > UJ — 2p > Ap > V. 

• For all 1 < k < r, there exists rrij € Slfc such that w(j) > jqrj. By assumption, for all 
1 < k < r, there exists rrij £ f2/% such that \\m,j — Wk\\i < e, hence for all rrii € we have 

- mi||i = |[(mj - u;*.) + (w fe - mi)||i < p + e < v while X]jen fc x 00 > f+r 

• If mi ^ Ui<fc< r r2fc and > jq^j-, then ||mj — < 3p + 2e for some 1 < k < r. Suppose 

— Wk\\i > 3p + 2e for all k, then 

— mj||i > ||(mi — Wk) + {wk — rnj)\\i > 3/3 + 2e — p > v for all rrij € Ui^^Q^. 

Therefore, EjeS* X U) ^ r ~ Efc Ejefi fc x * < r ~ r f+T < a contradiction. 

Let then A; be such that \ \rrii — Wk\\i < 3p + 2e. This implies that if rrij 6 <Sj, then either € ri^, 

or rrij ^ U^'^fc^fc'. In fact, if rrij E fi^,/ for some k' ^ k, then 

\\rrii — rrij\\i > \\{rrii — Wk) + (wk — w^) + {wk> — fnj)\\l > uj — 3p — 2e — p > 2p — 2e > v, 

a contradiction. 

These observations imply that there are at least r disjoint sets Si with weight larger than each 
corresponding to a different cluster Qk- Therefore, Algorithm H] will identify them individually and 
([9]) will be satisfied. 

For the case v < p + e, the result follows directly from the observations above: any point rrii with 
w (i) > 7+1 must satisfy \\rrii — Wk\\i < 3/> + 2e for some 1 < k < r. Moreover, for all k there must exist 
j G K, such that \\rrij — Wk\\i < 3p + 2e. In fact, suppose there exists k such that \\rrij — Wk\\ > 3p + 2e 
for all j € /C. Then, rrii ^ Uj-g^iSj for all i G Qk (see above) hence X^ieS je/c x (^) < r — f+r = r f+T 
which implies that K, cannot contain more than r — 1 indices, a contradiction. □ 
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Theorem 5. Let M = WH be a normalized r-separable matrix with W K-robustly conical. Let also 
M = M + N with \ \N\lx < e. If 



UJK 

e < 



74(r + 1) ' 

where oj = mim^y ||W(:,i) — W(:, then Algorithm^ will extract a matrix W such that 

\\W — W(:, P)\\i < 5 = 37(r + 1) — h 2e, for some permutation P. 

Proof. Let X be a feasible solution of (|2|) , let the r clusters £l p k 1 < k < r be defined as in Equation ([8]) 
and let ct = YljeW -X-Uij)' ^ P < % an< ^ c k > f+i> then, by Lemma El Algorithm [J] will identify a 
set K, with r indices such that 

max min||W(:,fc) -M{:,j)\\i <5 = 3p + 2e, 

l<k<r j£K, 

for any v £ [p + e,2p + 2e]. Therefore, starting with v = 2e < (p + e) and multiplying it by two 
at each iteration will eventually give a value of v in [p + e, 2/9 + 2e]. (Note that Algorithm H] could 
return a set K, with r indices for v smaller than p + e, see Lemma [6l Note also that the number of 
iterations performed by Algorithm [3] is at most log 2 (^t^)O If e = 0, then q. = 1 for all 1 < < r 
while /> = < ^, and the loop is entered at most once (if the entries of p are distinct, then it is not 
entered because exactly r diagonal entries of an optimal solution of ([2]) will be equal to one, each 
corresponding to a different column of W [2, Prop. 3.1]). Otherwise e > and it remains to guarantee 
that P < % and cj~ > ^qrp By Lemma [5l 

/ 3 — e \ pK r 

e --, < 77 , 7S c k> 



l-ej 4(r + l) r + 1' 

. . : : o ti r 1 n ^L. { 0™ . 

74(r+l) 



Taking e < 7i ?* +1 \ and p = ^-(r + l)^<^ completes the proof since 



P = ^"( r + !)- > 4 ( r + !)- 



3 K K \ l- £ 

because 3 < fff < f| for any < e < 10~ 2 (as 74(r + 1) > 100 for r > 1). □ 

It can be checked that all the results from Section [2] apply to Algorithm [3l In fact, by assumption, 
all the matrices considered did not contain duplicate or near-duplicate of the columns of matrix W . 
In fact, we showed that r diagonal entries of X have weight at least ^rj implying that Algorithm [3] 
will not enter the while loop, hence it is equivalent to Algorithm [TJ In particular, Corollary [1] also 
applies to Algorithm [21 that is, it is necessary that 

e < , for any o < — . 

r — 1 2 

This shows that the bound of Theorem for e is tight up to a factor u (and some constant multiplicative 
factor). Moreover, by Theorem 01 the bound for 5 is tight up to a factor r (and some constant 
multiplicative factor) . 

Remark 3 (Computational Cost). The main additional cost of Algorithm^ compared to Algorithm^ 
is to computing and storing the distance matrix D. This requires 0(mn 2 ) floating point operations 
and 0(n 2 ) space in memory. This is negligible as computing MX already requires 0(mn 2 ) operations, 
while storing X requires 0(n 2 ) space in memory. Notice that if diag(A) contains zero entries, they 
can be discarded along with the corresponding data points. 
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Remark 4 (Choice of the vector p). Because of the post-processing procedure in Algorithm^ it is 
not necessary for Theorem [3| to hold that the vector p has distinct entries. However, it will still be 
useful in practice to impose this condition. In fact, this will incite the weights to be concentrated in 
fewer diagonal entries of X so that typically fewer loops will have to be performed to obtain a set fC 
containing r indices. In particular, in the exact case (that is, e = 0) or in the case there is no duplicate 
and near duplicate in the dataset (see above), the loop will not be entered. 

Remark 5 (More Sophisticated Post-processing Strategies). It is possible to design better post- 
processing procedures but we wanted here to keep the analysis simple. In particular, if the input matrix 
M does not satisfy the conditions of Theorem it may happen that no set K, computed in the loop of 
Algorithm^ contains r elements. Therefore, one should keep in memory the largest set extracted so far, 
or design more sophisticated strategies. For example, if less than r clusters have been extracted, the con- 
dition that the weight of each extracted clus ter must larger than can be relaxed; this variant has been 



implemented in the Matlab code available at https: //sites, google, com/ site/ nicolasgillis/ code 



3.3 Repartition of the Weights inside a Cluster 

In this last section, we show that HottTopixx (Algorithm [T]) cannot provide better bounds than 
Algorithm [3l The reason is the following: inside a cluster Of, there is no guarantee that all the 
weight will be assigned to a single diagonal entry of X (as originally claimed in [3]). In the proof of 
Theorem [6l we show that the weight may be equally distributed inside a cluster. This construction 
allows us to show that e < nrzfp 1S necessary for Proportion [2] to hold for any 5 < k + e, which proves 
our claim. 

Theorem 6. For any r > 3 and 5 < K + e, it is necessary for Proposition^ to hold that 

K 

e < 



(r-1) 2 ' 

Proof. See Appendix C. □ 
A Matlab code is available at https : / / sites . google . com/site/nicolasgillis/ code containing 



the constructions from Theorems O 0] and El and the post-processing procedure (Algorithm [3|) . In 
particular, the construction from Theorem [S] provides examples where Algorithm [3] is more robust 
than HottTopixx. 



4 Conclusion and Further Work 

In this paper, we proposed a more robust variant of HottTopixx based on an appropriate post- 
processing of the solution of the linear program ([2]) (see Algorithm [3]) . We proved that Algorithm [3] is 
robust for any input separable matrix M (Theorem [SJ, while our analysis is close to being tight. 

It would be interesting to improve the bound of Theorem [5] or show that the bound is tight. It 
would also be particularly interesting to design more robust or computationally more effective (or 
both?) separable NMF algorithms. 

Appendix A. Proof of Theorem [I] 

Proof of Theorem 21 Let 

k = argmin 1<1<r min ||W(:,j) — W(:, ^/)x\U, where J = {1, 2, ... ,r}\{j\, 
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w = W(:, k), and 



y = aigmin x6R r-i \\W{:, k) - W(:,H)x\\u where U = {1,2, .. . ,r}\{k}, 

so that, by definition, \\y — = k. If y = 0, we are done since k = \\w\\± = 1 > \a as a < 2. 
Otherwise y / and we define z = Tnjjrn = A -1 ?/. By definition, — z||i > a since z belongs to the 
convex hull of the columns of W. We have 

a < \\w - z\\i = \\w - (A + 1 - A)z||i <\\w - \z\\i + (1 - A)||z||i = k + (1 - A) < 2k 

since k = \\w — Xz\\i > \ \w\\i — ||Az||i = 1 — A, and the proof is complete. □ 



Appendix B. Proof of Theorem [3] 

The following lemma shows that if one of the coefficients in the objective function of a linear program 
is much larger than all the other ones, then the corresponding entry of any optimal solution must be 
smaller than the corresponding entry of any feasible solution. Although the result is clear intuitively, 
we provide here a simple proof. 

Lemma 7. Let consider the following linear program 

min CjsX such that Ax = b and I < x < u, (10) 

with l,u E M n , / < u, and ck = (K,c) E W 1 where K E R is a parameter. Let us denote x* K an 
optimal solution of (|10p depending on K. Assume there exists a feasible solution x* of ()10p such that 
x$ (1) = s. Then, for any K sufficiently large, x* K {\) < s. 

Similarly, i/cx(l) = —K and there exists a feasible solution such that x(l) = t, Then, for any K 
sufficiently large, a^(l) > t. 

Proof. Let V ^ be the set of vertices of the feasible set of (|10p . and V = {x E V | x(l) > s}. Notice 
that because the feasible set of (|10p is a polytope, there always exists an optimal solution in V. Let 
us denote d = min xg -pj;(l) > s. Assume there exists an optimal solution x* K such that x* K {l) > s. 
This implies that there exists an optimal solution x* K E V (since any optimal solution is a convex 
combination of optimal vertices in V). Therefore, 

Kd — \\c\\2\\u\\2 < cJ(X* K = c^x* K = Kx* K (l) + c T x* K (2:n) < c T K x^ < Ks + ||c||2|M|2, 

which is absurd for any K > HlM^JMk^ □ 

The linear program ^fy can be written in the form of (|1U|) : in fact, < X < 1 while the ran 
additional variables necessary to express the constraint \\M — MX\\\ < 2e linearly will be in the 
interval [0, 2e]. Therefore, Lemma [7] applies to ([2]). 

Proof of Theorem 0. We prove the result with the following construction: Let 

V c 1 - § y 

which is K-robustly conical. Let also 

H = ( I r (3L r + (E r -L r )^ ) 
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so that 1 1 H' 1 1 oo = f3 (note that (5 must be larger than - since the columns of H' sum to one), N = 0, 
M = WH + N,p= (1,2,3, 



, r — 1, — K, —1, —2, . . . , — (r — 1), —K 2 ) T for K sufficiently large, and 
k(1-j8) < « 



" (r-l)(l-/3) + l - r-1' 

Assume that 

( (1 - 5)J r _! + £ET (#r-l - ir-l) (l-^Vi + ^^i-I^i)^-/?)) 0^ 

^f(H-/3)e 



where 



V 



r— 1 





1 













i y 



5 = (2 — /3)cj and w 



«(1 " /3) 



, implying X(n, n) = uj + (r — 1) (5 — uj) = 1, 



is a feasible solution of ([2]) (note that n = 2r). By LemmaO there exists K sufficiently large such that 
any optimal solution X* must satisfy X*(n,n) = 1. Using Lemma [7] again, there exists K sufficiently 
large such that X*(r, r) = 1. Therefore, for K sufficiently large, the rth and nth column of M will be 
extracted implying 

\\W - W(-,P)\\i = , ™> , ~ M(:,n)||! = \\W(:, 1) - M(:,n)||i = > ^« > e, 

l<?<r-l r — 1 r — 1 

and the proof will be complete. 

It remains to show that X is feasible: Clearly, tr(X) = r. For the constraints < X < 1, we check 
that 

e 1 ..,_„, (l-/3) + l 



< w 



and 



k(1-/3) r(l-/3) + l 
1 /l - w 



< 5 = (2 - /3)w 



r(l-/3) + l 



< 1, 



< 



r-1 ^1-5 

For X(i,j) < X(i,i) for all i,j, we only have to check that 

5 — uj 



i 1 — uj r — 1 

/3 < 1 since = > 1. 

l — o r — 2 



1-6 > 



r — 1 



(r-l)(r-2)(l-/?)>(l-/3). 



It remains to verify that \\M(:,j) - MX(:,j)||i < 2e for all 1 < j < 2r: 
• 1 < j < r — 1. We have 



||M(:,j)-MX(:,j)||i 



M(:, j) - (1 - «y)M(:, j) - -^—M(:,j+r] 

r — 1 

<5M(:, j) - wM(:, j +r) - -VrM(:, J)e 

r — 1 

w \\M(:,j) - M(:,j + r)^ + (5 - uj) M(:,j) 

uik(1 - P) + (6 - uj)k 
2ujk{1 -P) = 2e, 



where J = {1,2,..., r}\{j}. 
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r + l<j<2r — 1. We have 

||M(:,i) - MX{:,j)\\ x = \\M(:,j) - uM(:,j) - (1 - 6)/3W(:,j - r) - (1 - u - 0(1 - 5)) ^ 

= (!-«) 



M(:,i) - ^4/3M(:,j - r) - f 1 - 0^-4 I <r, 



r-Y 



r — 1 



(1- W ) 



M(:,i) -11- 01^:, j - r ) - (l - + ' 



r - 1 



r - 1 



0(1-0; 



— \\W(:,j -r) -w^ 



0(1 — uj)k (1 — uj)k 



(r-l)(l-/3)« 



< 



r - 1 
(1-0)k 



< 



r ((r-l)(l-0) + l) " (r-l)(l-0) + l ' 

where ^ = W(:,K)^ and ft = {1, 2, . . . , r}\{j - r}. In fact, ±% = ^f, < ± and, by 
construction, M(:,j) = (3W(:,j — r) + (1 — 0)iUj. 

□ 



Appendix C. Proof of Theorem [6] 

Proof of Theorem We prove the result with the following construction: Let 



W= (1 



H T 

2 r 



2> 
rX r 



k\ g t 



if 



r-l 1 



1 (l-A)e J 



where A = 2-, 



N 



0(r+l^xr 0(r+l)xl 0(r+l)x(r-l) 0(r+l)xl 



0(r-l)x(r-l) 0(r-l)xl 



Olx(r-l) 



z 







(r-l)xl 



where Z = xI r -\ + y(E r ^i — I r _i) with x = ^rrj-e and y = —§>■ The matrix Z has been constructed 
so that ||Z(:,j)||i < e for all j, £\ ^(«,j) = for all i, and M(:,j) - 7 ^ T M(:,l)e = 2e for all 
j e J = { r + 1, r + 2, . . . 2r - 1} and 1 = J\{i}. Let also M = WH + N, 



< € < 



(r - l) 2 ~ 2(r - 1) 



so that A < 



r - 1 



and 



p = (1, 2, . . . , r - 1, K\ K 2 , K 2 + 1, . . . , K 2 + r - 1, -iff, 
for if sufficiently large. Assume 



X 





-J-re ^-r^r-i 

r— 1 r— 1 ' 1 



2e e T 











is feasible for ^fy. Letting X* be any optimal solution, by Lemma there exists K sufficiently 
large such that X*(r,r) = 0. By Lemma [8] (see below), this implies that X*(j,j) > -Kr for j € J . 
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Using Lemma [7] again, we have that for K sufficiently large X*(j,j) = -jk- f° r all f° r 3 € J, and 
X*(n,n) > (r — Therefore, since 



2e(r - 1) 1 
— > 



e > 



A 1 26 1 

and 1 > 

k r — 1 



r — 2\ k k 

f<l 7=i 2-1' 



« r-1 ■ ■ 2(r-l) 2 

the first r — 1 columns and the last column of M will be extracted so that 

= ||W(:,r)-M(:,n)||i = k + e, 

and the proof will be complete. 

It remains to show that X is feasible. We clearly have tr(X) = r, < X < 1, and X(i,j) < X(i, i) 
for all i, j because e < 2(r-i) wnue ) f° r ~~ < 2e for all j, we have 

• 1 < J < r — 1. 

\\M(:,j) - MX(:, j)||i = -\\M(:,j) - M(:,n)\\x = 2^e < 2e. 

K r — 1 

• j = r. This follows from Lemma [8l 

• r + l<j<2r — 1. This follows from the construction of matrix Z. 
. j = 2r. M(:,j) = MX(:,j) since M(:,j) = ^W(:, l:r - l)e. 

□ 

Lemma 8. Let W, -ff , AT and M = WH + N be the matrices constructed in Theorem Let also 
K = {1,2, ... ,n}\{r}. Then 

mm\\M(:,r) - M(:,TZ)x\\i = 2e, (11) 

x>0 



and the unique optimal solution of (fTT|) is given by 



x' 



0(r-l)xl 

Olxl 



D 2r-1 



Proof. Let j;* = (y,z,w) be an optimal solution of (jlip where y,z £ and u; £ K+. We have to 

show that x* = x>. From x* , let us construct another optimal solution x' = (y',z',0) such that all 
the entries of y' and z' are equal to each other. Because M(:,n) = ^ryM(:, l:r — l)e, we take w = 0, 
replace z z + -^j and obtain an equivalent solution. Let us denote TZ = 1Z\{n} and 



M(:,r) -M(:,K) 



By symmetry, one can check that g(y, z) = g(y(P), z(P)) for any permutation P of {1, 2, . . . , r — 1} 
(this simply amounts to permuting the first and last r — 1 columns of M(:,TZ)). By convexity, (y f , z') = 
jjjT X^Pen(y(-f)' z (P))i where II is the set of all possible permutations of {1, 2, . . . , r — 1}, is also an 
optimal solution of (jlip hence all entries of y' and z 1 are equal to each other, and ||y'||i = \\y\\\ and 

IMIi = Mli- 

Therefore, denoting y'(i) = and z'(i) = — for all 1 < i < r — 1, the optimization problem 
(fTTl) can be reduced to 



mm 

a,b>0 



( °(r— l)xl \ 



\ 0(r-l)xl / 



1 - - 
1 2 



V °(r-l)xl / 



2(r-l) 



1-A)f 
1-f 


V 0( r _!) xl / 



(12) 
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= min h(a, b) = J \a + Xb\ + ~ |1 - (1 - X)b\ + (l - ~) |1 - a - b\ + e \a\ . 

a,b>0 2 2 V 2 J 

Let us show that (a*, b*) = (0,1) is the unique optimal solution, for which h(0, 1) = kX = 2e. For 
a + b > 1, the subdifferential of h in a is 1 — e > 0, while, for a + 6 < 1, the subdifferential of /t in 6 is 
kA — 1 < hence a* + b* = 1 at optimality. Substituting a = 1 — b above, we obtain 

b* = argmino^-L 2|1 - (1 - X)b\ = 1, 

which is unique as the slope at b = 1 is positive. 

Finally, we have b* = 1 , a* = is the unique solution of fjl2[) implying that y' = y = and that the 
minimal objective function value of (jlip is 2e. Moreover, this implies ||z'||i = ||-z||i = 1- It remains 
to show that the entries of z are equal to each other, that is, show that the unique solution to the 
following system 





( 0( r _i) xl ^ 








1 


K 


l-l 








V 0(r-l)xl / 









= 2e, 



is z* = ^rrfe, which is clearly the case as the only z such that Zz = and ||z||i = 1 is z* . This 
completes the proof. □ 
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